Internet Info 1997 December

home *** CD-ROM | disk | FTP | other *** search

/ Internet Info 1997 December / Internet_Info_CD-ROM_Walnut_Creek_December_1997.iso / ietf / urn / urn-archives / urn-ietf.archive.9611 / 000175_owner-urn-ietf _Fri Nov 15 13:42:31 1996.msg < prev next >

Wrap

Internet Message Format | 1997-02-19 | 6KB

Received: (from daemon@localhost) by services.bunyip.com (8.6.10/8.6.9) id NAA15446 for urn-ietf-out; Fri, 15 Nov 1996 13:42:31 -0500 Received: from mocha.bunyip.com (mocha.Bunyip.Com [192.197.208.1]) by services.bunyip.com (8.6.10/8.6.9) with SMTP id NAA15437 for <urn-ietf@services.bunyip.com>; Fri, 15 Nov 1996 13:42:28 -0500 Received: from josef.ifi.unizh.ch by mocha.bunyip.com with SMTP (5.65a/IDA-1.4.2b/CC-Guru-2b) id AA29572 (mail destined for urn-ietf@services.bunyip.com); Fri, 15 Nov 96 13:41:39 -0500 Received: from ifi.unizh.ch by josef.ifi.unizh.ch id <00924-0@josef.ifi.unizh.ch>; Fri, 15 Nov 1996 19:40:39 +0100 Subject: Re: [URN] Re: I18N does not belong in URNs To: yergeau@alis.com Date: Fri, 15 Nov 1996 19:40:38 +0100 (MET) Cc: dgd@cs.bu.edu, urn-ietf@bunyip.com In-Reply-To: <2.2.32.19961115155024.007169c0@genstar.alis.ca> from "Francois Yergeau" at Nov 15, 96 10:50:24 am Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Content-Length: 4755 From: Martin J Duerst <mduerst@ifi.unizh.ch> Message-Id: <"josef.ifi..890:15.10.96.18.40.40"@ifi.unizh.ch> Sender: owner-urn-ietf@services.bunyip.com Precedence: bulk Reply-To: Martin J Duerst <mduerst@ifi.unizh.ch> Errors-To: owner-urn-ietf@bunyip.com Francois Yergeau wrote: >I fail to see why the %-encoded URN should be the reference. This is a >fallback to the bad old 7-bit days, and results in a needless waste of >bandwidth and storage resources. Reading the recent report of the IAB >Character Set Workshop (draft-weider-iab-char-wrkshop-00.txt) It's nice to have this available finally. >, I find in >section 8.2 (Recommendations for new Internet protocols): > > "New protocols do not suffer from the need to be compatible > with old 7-bit pipes. New protocol specifications SHOULD > use ISO 10646 as the base charset unless there is an > overriding need to use a different base charset." That's indeed what we are doing. Pipe width and base charset are not directly related. >Elsewhere (3.4.3), UTF-8 is recommended as the encoding and use of escape >mechanisms is warned against ("...must be weighed very carefully"). This warns against techniques such as SGML &#nnn;. %HH is not on the character level, it is on the octet level. And it is already well established for URLs. >> We can define the standard as %-encoded UTF-8, and if people implement >>this other ways, they are implementing convenience features in the >>interface: the software will always have the %-encoded URN available. Much software will probably do so anyway, despite what the standard says, and without creating a conflict, because storing and comparing is more efficient on the 8-bit form. >As if 8-bit octets on-the-wire were something evil! It is much wiser, IMHO, >to have the real UTF-8 as the reference value, and have the %-encoding as >the convenience feature (it must be there anyway for reserved and unsafe >characters, so there is no risk that an application will not support it). >If a user needs ASCII-only, let *his* software do the %-encoding for him, >but let's not force a 9-byte encoding on CJK characters when 3 are enough. > >There should be a good reason to burden the whole world forever with >%-encoding of all 8-bit octets, and I see none at all, except for a visceral >and unwarranted fear of 8-bit octets. I think we have to be careful, because there are at least two ways in which URNs can be transferred/stored: - In "dedicated" protocols and databases. An example is the header of an HTTP request. - In text. An example is HTML. For the former, raw 8-bit (i.e. UTF-8) can be used. According to the standards, officially HTTP headers are limited to ASCII, but in practice, they will pass 8 bits without problems. (If not, please don't make a long discussion out of this. It only serves as an example of a (part of) a protocol that up to now transmitted "raw" data without consideration to character set issues. For the later, as Francois probably knows even better than I do from his work on URL internationalization, putting an URN with 10646 characters into a HTML document written in iso-8859-1 in raw 8-bit form will produce bad results. Without extremely clever tool support, it will neither be possible to input such an URN, nor will an URN show up with the characters it represents. Transcoding, as well as other operations such as cut-and-paste, will also not do what everybody would hope for. Just saying "use 8 bits, use 8 bits" could however give the impression to some implementors that the UTF-8 8-bit octets should appear as such in an HTML document in iso-8859-1. Whatever we make the "standard" or "base" form, or whether we such a form or not, we should therefore clearly say that URNs - Can be transmitted/stored in 8-bit form in protocols/databases that accomodate URNs as such, and not as part of text and/or associated with character encoding information. - Have to be interpreted and treated as characters when transmitted as part of an encoded text with (explicitly or implicitly) associated character encoding information. Those characters that cannot be represented in the choosen encoding, as well as %HH sequences that do not form valid UTF-8 sequences (and of course reserved characters) have to stay in %HH form. I know that the last point may again frighten some of you. It seems to introduce a new representation. But if you think about URLs in EBCDIC, you will see that it is nothing new. Personally, I think that the second paragraph above could be amended with a sentence saying that to avoid eventual misinterpretations due to lack of appropriate information about character encoding, and to make the URN transcribable to the widest audience, full %HH encoding can/should be choosen. We may have to discuss about how strong this wording should be. But we definitely have to include something that avoids misunderstandings so that raw 8-bit UTF-8 will never turn up as such in e.g. iso-8859-1 documents. Regards, Martin.